library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
library(tidytext)
library(twitteR)
library(wordcloud)
library(scales)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())Text data can come from lots of areas:
The easier to convert your text data into digitally stored text, the cleaner your results and fewer transcription errors.
A text corpus is a large and structured set of texts. It typically stores the text as a raw character string with meta data and details stored with the text.
Examples of typical transformations include:
Feature extraction involves converting the text string into some sort of quantifiable measures. The most common approach is the bag-of-words model, whereby each document is represented as a vector which counts the frequency of each term’s appearance in the document. You can combine all the vectors for each document together and you create a term-document matrix:
However the bag-of-word model ignores context. You could randomly scramble the order of terms appearing in the document and still get the same term-document matrix.
At this point you now have data assembled and ready for analysis. There are several approaches you may take when analyzing text depending on your research question. Basic approaches include:
More advanced methods include document classification, or assigning documents to different categories. This can be supervised (the potential categories are defined in advance of the modeling) or unsupervised (the potential categories are unknown prior to analysis). You might also conduct corpora comparison, or comparing the content of different groups of text. This is the approach used in plagiarism detecting software such as Turn It In. Finally, you may attempt to detect clusters of document features, known as topic modeling.
So far we’ve used basic plots from ggplot2 to visualize our text data. However we could also use a word cloud to represent our text data. Also known as a tag cloud, word clouds visually represent text data by weighting the importance of each word, typically based on frequency in the text document. We can use the wordcloud package in R to generate these plots based on our tidied text data.
To draw the wordcloud, we need the data in tidy text format, so one-row-per-term. For example, here is a wordcloud of a set of tweets related to #rstats:
library(twitteR)# You'd need to set global options with an authenticated app
setup_twitter_oauth(getOption("twitter_api_key"),
getOption("twitter_api_token"))## [1] "Using browser based authentication"
library(wordcloud)
# get tweets
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))" # custom regular expression to tokenize tweets
rstats <- searchTwitter('#rstats', n = 3200) %>%
twListToDF %>%
as_tibble
# tokenize
rstats_token <- rstats %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
# plot
rstats_token %>%
count(word) %>%
filter(word != "#rstats") %>%
with(wordcloud(word, n, max.words = 100))Or tweets by Pope Francis:
# get tweets
pope <- userTimeline("Pontifex", n = 3200) %>%
twListToDF %>%
as_tibble
# tokenize
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))" # custom regular expression to tokenize tweets
pope_token <- pope %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
# plot
pope_token %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))We can even use wordclouds to compare words or tokens through the comparison.cloud() function. For instance, how do the tweets by Donald Trump compare to Pope Francis? In order to make this work, we need to convert our tidy data frame into a matrix first using the acast() function from reshape2, then use that for comparison.cloud().
library(reshape2)
# get fresh trump tweets
trump <- userTimeline("realDonaldTrump", n = 3200) %>%
twListToDF %>%
as_tibble
# tokenize
trump_token <- trump %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
bind_rows(Trump = trump_token, Pope = pope_token, .id = "person") %>%
count(word, person) %>%
acast(word ~ person, value.var = "n", fill = 0) %>%
comparison.cloud(max.words = 100, colors = c("blue", "red"))The size of a word’s text is in proportion to its frequency within its category (i.e. proportion of all Trump tweets or all pope tweets). We can use this visualization to see the most frequent words/hashtags by President Trump and Pope Francis, but the sizes of the words are not comparable across sentiments.
An n-gram is a contiguous sequence of \(n\) items from a given sequence of text or speech.
This starts to incorporate context into our visualization. Rather than assuming all words/tokens are unique and independent from one another, n-grams of size 2 and up join together pairs or combinations of words in order to identify frequency within a document.
Sentiment analysis uses text analysis to estimate the attitude of a speaker or writer with respect to some topic or the overall polarity of the document. For example, the sentence
I am happy
contains words and language typically associated with positive feelings and emotions. Therefore if someone tweeted “I am happy”, we could make an educated guess that the person is expressing positive feelings.
Obviously it would be difficult for us to create a complete dictionary that classifies words based on their emotional affect; fortunately other scholars have already done this for us. Some simply classify words and terms as positive or negative:
get_sentiments("bing")## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
Others rate them on a numeric scale:
get_sentiments("afinn")## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
Still others rate words based on specific sentiments
get_sentiments("nrc")## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
get_sentiments("nrc") %>%
count(sentiment)## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
In order to assess the document or speaker’s overall sentiment, you simply count up the number of words associated with each sentiment. For instance, how positive or negative are Jane Austen’s novels? We can determine this by counting up the number of positive and negative words in each chapter, like so:
library(janeaustenr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
janeaustensentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")Ignoring the specific code, this is a relatively simple operation. Once you have the text converted into a format suitable for analysis, tabulating and counting term frequency is not a complicated operation.
Every non-hyperbolic tweet is from iPhone (his staff).
— Todd Vaziri (@tvaziri) August 6, 2016
Every hyperbolic tweet is from Android (from him). pic.twitter.com/GWr6D8h5ed
If you want to know what President Donald Trump personally tweets from his account versus his handlers, it looks like we might have a way of detecting this difference. Tweets from an iPhone are his staff; tweets from an Android are from him. Can we quantify this behavior or use text analysis to lend evidence to this argument? Yes.
library(twitteR)# You'd need to set global options with an authenticated app
setup_twitter_oauth(getOption("twitter_api_key"),
getOption("twitter_api_token"))## [1] "Using browser based authentication"
# We can request only 3200 tweets at a time; it will return fewer
# depending on the API
trump_tweets <- userTimeline("realDonaldTrump", n = 3200)
trump_tweets_df <- trump_tweets %>%
map_df(as.data.frame) %>%
tbl_df()# if you want to follow along without setting up Twitter authentication,
# just use this dataset:
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
str(trump_tweets_df)## Classes 'tbl_df', 'tbl' and 'data.frame': 1512 obs. of 16 variables:
## $ text : chr "My economic policy speech will be carried live at 12:15 P.M. Enjoy!" "Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8" "#ICYMI: \"Will Media Apologize to Trump?\" https://t.co/ia7rKBmioA" "Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton "| __truncated__ ...
## $ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ favoriteCount: num 9214 6981 15724 19837 34051 ...
## $ replyToSN : chr NA NA NA NA ...
## $ created : POSIXct, format: "2016-08-08 15:20:44" "2016-08-08 13:28:20" ...
## $ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ replyToSID : logi NA NA NA NA NA NA ...
## $ id : chr "762669882571980801" "762641595439190016" "762439658911338496" "762425371874557952" ...
## $ replyToUID : chr NA NA NA NA ...
## $ statusSource : chr "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
## $ screenName : chr "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" ...
## $ retweetCount : num 3107 2390 6691 6402 11717 ...
## $ isRetweet : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ longitude : chr NA NA NA NA ...
## $ latitude : chr NA NA NA NA ...
Let’s next clean up the data frame by selecting only the relevant columns, extracting from statusSource the name of the application used to generate the Tweet, and filter for only tweets from an iPhone or an Android phone. The extract() function uses a regular expression to extract the app name.
tweets <- trump_tweets_df %>%
select(id, statusSource, text, created) %>%
extract(statusSource, "source", "Twitter for (.*?)<") %>%
filter(source %in% c("iPhone", "Android"))
tweets %>%
head() %>%
knitr::kable(caption = "Example of Donald Trump tweets")| id | source | text | created |
|---|---|---|---|
| 762669882571980801 | Android | My economic policy speech will be carried live at 12:15 P.M. Enjoy! | 2016-08-08 15:20:44 |
| 762641595439190016 | iPhone | Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 | 2016-08-08 13:28:20 |
| 762439658911338496 | iPhone | #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA | 2016-08-08 00:05:54 |
| 762425371874557952 | Android | Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! | 2016-08-07 23:09:08 |
| 762400869858115588 | Android | The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! | 2016-08-07 21:31:46 |
| 762284533341417472 | Android | I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! | 2016-08-07 13:49:29 |
What can we say about the difference in the content? We can use the tidytext package to analyze this.
We start by dividing into individual words using the unnest_tokens() function, and removing some common stopwords. This is a common aspect to preparing text for analysis. Typically, tokens are single words from a document. However they can also be (bi-grams) (pairs of words), tri-grams (three-word sequences), n-grams (\(n\)-length sequences of words), or in this case, individual words, hashtags, or references to other Twitter users. Because tweets are a special form of text (they can include words, urls, references to other users, hashtags, etc.) we need to use a custom regular expression to convert the text into tokens.
library(tidytext)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))" # custom regular expression to tokenize tweets
# function to neatly print the first 10 rows using kable
print_neat <- function(df){
df %>%
head() %>%
knitr::kable()
}
# tweets data frame
tweets %>%
print_neat()| id | source | text | created |
|---|---|---|---|
| 762669882571980801 | Android | My economic policy speech will be carried live at 12:15 P.M. Enjoy! | 2016-08-08 15:20:44 |
| 762641595439190016 | iPhone | Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 | 2016-08-08 13:28:20 |
| 762439658911338496 | iPhone | #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA | 2016-08-08 00:05:54 |
| 762425371874557952 | Android | Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! | 2016-08-07 23:09:08 |
| 762400869858115588 | Android | The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! | 2016-08-07 21:31:46 |
| 762284533341417472 | Android | I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! | 2016-08-07 13:49:29 |
# remove manual retweets
tweets %>%
filter(!str_detect(text, '^"')) %>%
print_neat()| id | source | text | created |
|---|---|---|---|
| 762669882571980801 | Android | My economic policy speech will be carried live at 12:15 P.M. Enjoy! | 2016-08-08 15:20:44 |
| 762641595439190016 | iPhone | Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 | 2016-08-08 13:28:20 |
| 762439658911338496 | iPhone | #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA | 2016-08-08 00:05:54 |
| 762425371874557952 | Android | Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! | 2016-08-07 23:09:08 |
| 762400869858115588 | Android | The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! | 2016-08-07 21:31:46 |
| 762284533341417472 | Android | I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! | 2016-08-07 13:49:29 |
# remove urls
tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
print_neat()| id | source | text | created |
|---|---|---|---|
| 762669882571980801 | Android | My economic policy speech will be carried live at 12:15 P.M. Enjoy! | 2016-08-08 15:20:44 |
| 762641595439190016 | iPhone | Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: | 2016-08-08 13:28:20 |
| 762439658911338496 | iPhone | #ICYMI: “Will Media Apologize to Trump?” | 2016-08-08 00:05:54 |
| 762425371874557952 | Android | Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! | 2016-08-07 23:09:08 |
| 762400869858115588 | Android | The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! | 2016-08-07 21:31:46 |
| 762284533341417472 | Android | I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! | 2016-08-07 13:49:29 |
# unnest into tokens - tidytext format
tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
print_neat()| id | source | created | word |
|---|---|---|---|
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | record |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | of |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | health |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | #makeamericagreatagain |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | #trump2016 |
| 676509769562251264 | iPhone | 2015-12-14 21:11:12 | another |
# remove stop words
tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]")) %>%
print_neat()| id | source | created | word |
|---|---|---|---|
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | record |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | health |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | #makeamericagreatagain |
| 676494179216805888 | iPhone | 2015-12-14 20:09:15 | #trump2016 |
| 676509769562251264 | iPhone | 2015-12-14 21:11:12 | accolade |
| 676509769562251264 | iPhone | 2015-12-14 21:11:12 | @trumpgolf |
# store for future use
tweet_words <- tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))What were the most common words in Trump’s tweets overall?
Yeah, sounds about right.
One measure of how important a word may be is its term frequency (tf), how frequently a word occurs within a document. The problem with this approach is that some words occur many times in a document, yet are probably not important (e.g. “the”, “is”, “of”). Instead, we want a way of downweighting words that are common across all documents, and upweighting words that are frequent within a small set of documents.
Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents. It is a rule-of-thumb or heuristic quantity, not a theoretically proven method. The inverse document frequency for any given term is defined as:
\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]
To calculate tf-idf for this set of documents, we will pool all the tweets from iPhone and Android together and treat them as if they are two total documents. Then we can calculate the frequency of terms in each group, and standardize that relative to the the term’s frequency across the entire corpus.
tweet_words_count <- tweet_words %>%
count(source, word, sort = TRUE) %>%
ungroup()
tweet_words_count## # A tibble: 3,235 x 3
## source word n
## <chr> <chr> <int>
## 1 iPhone #trump2016 171
## 2 Android hillary 124
## 3 iPhone #makeamericagreatagain 95
## 4 Android crooked 93
## 5 Android clinton 66
## 6 Android people 64
## 7 iPhone hillary 52
## 8 Android cruz 50
## 9 Android bad 43
## 10 iPhone america 43
## # ... with 3,225 more rows
total_words <- tweet_words_count %>%
group_by(source) %>%
summarize(total = sum(n))
total_words## # A tibble: 2 x 2
## source total
## <chr> <int>
## 1 Android 4901
## 2 iPhone 3852
tweet_words_count <- left_join(tweet_words_count, total_words)
tweet_words_count## # A tibble: 3,235 x 4
## source word n total
## <chr> <chr> <int> <int>
## 1 iPhone #trump2016 171 3852
## 2 Android hillary 124 4901
## 3 iPhone #makeamericagreatagain 95 3852
## 4 Android crooked 93 4901
## 5 Android clinton 66 4901
## 6 Android people 64 4901
## 7 iPhone hillary 52 3852
## 8 Android cruz 50 4901
## 9 Android bad 43 4901
## 10 iPhone america 43 3852
## # ... with 3,225 more rows
tweet_words_count <- tweet_words_count %>%
bind_tf_idf(word, source, n)
tweet_words_count## # A tibble: 3,235 x 7
## source word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 iPhone #trump2016 171 3852 0.04439 0.000 0.0000
## 2 Android hillary 124 4901 0.02530 0.000 0.0000
## 3 iPhone #makeamericagreatagain 95 3852 0.02466 0.693 0.0171
## 4 Android crooked 93 4901 0.01898 0.000 0.0000
## 5 Android clinton 66 4901 0.01347 0.000 0.0000
## 6 Android people 64 4901 0.01306 0.000 0.0000
## 7 iPhone hillary 52 3852 0.01350 0.000 0.0000
## 8 Android cruz 50 4901 0.01020 0.000 0.0000
## 9 Android bad 43 4901 0.00877 0.000 0.0000
## 10 iPhone america 43 3852 0.01116 0.000 0.0000
## # ... with 3,225 more rows
Which terms have a high tf-idf?
tweet_words_count %>%
select(-total) %>%
arrange(desc(tf_idf))## # A tibble: 3,235 x 6
## source word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 iPhone #makeamericagreatagain 95 0.02466 0.693 0.01709
## 2 iPhone join 42 0.01090 0.693 0.00756
## 3 iPhone #americafirst 27 0.00701 0.693 0.00486
## 4 iPhone #votetrump 23 0.00597 0.693 0.00414
## 5 iPhone #imwithyou 20 0.00519 0.693 0.00360
## 6 iPhone #crookedhillary 17 0.00441 0.693 0.00306
## 7 iPhone #trumppence16 14 0.00363 0.693 0.00252
## 8 iPhone 7pm 11 0.00286 0.693 0.00198
## 9 iPhone video 11 0.00286 0.693 0.00198
## 10 Android badly 13 0.00265 0.693 0.00184
## # ... with 3,225 more rows
tweet_important <- tweet_words_count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
tweet_important %>%
group_by(source) %>%
slice(1:15) %>%
ggplot(aes(word, tf_idf, fill = source)) +
geom_bar(alpha = 0.8, stat = "identity") +
labs(title = "Highest tf-idf words in @realDonaldTrump",
subtitle = "Top 15 for Android and iPhone",
x = NULL, y = "tf-idf") +
coord_flip()Most hashtags come from the iPhone. Indeed, almost no tweets from Trump’s Android contained hashtags, with some rare exceptions like this one. (This is true only because we filtered out the quoted “retweets”, as Trump does sometimes quote tweets like this that contain hashtags).
Words like “join”, and times like “7pm”, also came only from the iPhone. The iPhone is clearly responsible for event announcements like this one (“Join me in Houston, Texas tomorrow night at 7pm!”)
A lot of “emotionally charged” words, like “badly” and “dumb”, were overwhelmingly more common on Android. This supports the original hypothesis that this is the “angrier” or more hyperbolic account.
Since we’ve observed a difference in sentiment between the Android and iPhone tweets, let’s try quantifying it. We’ll work with the NRC Word-Emotion Association lexicon, available from the tidytext package, which associates words with 10 sentiments: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
nrc## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
To measure the sentiment of the Android and iPhone tweets, we can count the number of words in each category:
sources <- tweet_words %>%
group_by(source) %>%
mutate(total_words = n()) %>%
ungroup() %>%
distinct(id, source, total_words)
sources## # A tibble: 1,172 x 3
## id source total_words
## <chr> <chr> <int>
## 1 676494179216805888 iPhone 3852
## 2 676509769562251264 iPhone 3852
## 3 680496083072593920 Android 4901
## 4 680503951440121856 Android 4901
## 5 680505672476262400 Android 4901
## 6 680734915718176768 Android 4901
## 7 682764544402440192 iPhone 3852
## 8 682792967736848385 iPhone 3852
## 9 682805320217980929 iPhone 3852
## 10 685490467329425408 Android 4901
## # ... with 1,162 more rows
by_source_sentiment <- tweet_words %>%
inner_join(nrc, by = "word") %>%
count(sentiment, id) %>%
ungroup() %>%
complete(sentiment, id, fill = list(n = 0)) %>%
inner_join(sources) %>%
group_by(source, sentiment, total_words) %>%
summarize(words = sum(n)) %>%
ungroup()
head(by_source_sentiment)## # A tibble: 6 x 4
## source sentiment total_words words
## <chr> <chr> <int> <dbl>
## 1 Android anger 4901 321
## 2 Android anticipation 4901 256
## 3 Android disgust 4901 207
## 4 Android fear 4901 268
## 5 Android joy 4901 199
## 6 Android negative 4901 560
(For example, we see that 321 of the 4901 words in the Android tweets were associated with “anger”). We then want to measure how much more likely the Android account is to use an emotionally-charged term relative to the iPhone account. Since this is count data, we can use a Poisson test to measure the difference:
# function to calculate the poisson.test for a given sentiment
poisson_test <- function(df){
poisson.test(df$words, df$total_words)
}
# use the nest() and map() functions to apply poisson_test to each sentiment and
# extract results using broom::tidy()
sentiment_differences <- by_source_sentiment %>%
group_by(sentiment) %>%
nest() %>%
mutate(poisson = map(data, poisson_test),
poisson_tidy = map(poisson, tidy)) %>%
unnest(poisson_tidy, .drop = TRUE)
sentiment_differences## # A tibble: 10 x 9
## sentiment estimate statistic p.value parameter conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 anger 1.49 321 2.19e-05 274 1.235 1.81
## 2 anticipation 1.17 256 1.19e-01 240 0.960 1.43
## 3 disgust 1.68 207 1.78e-05 170 1.312 2.16
## 4 fear 1.56 268 1.89e-05 226 1.264 1.93
## 5 joy 1.00 199 1.00e+00 199 0.809 1.24
## 6 negative 1.69 560 7.09e-13 459 1.459 1.97
## 7 positive 1.06 555 3.82e-01 541 0.930 1.21
## 8 sadness 1.62 303 1.15e-06 252 1.326 1.99
## 9 surprise 1.17 159 2.17e-01 149 0.908 1.51
## 10 trust 1.13 369 1.47e-01 351 0.960 1.33
## # ... with 2 more variables: method <fctr>, alternative <fctr>
And we can visualize it with a 95% confidence interval:
sentiment_differences %>%
ungroup() %>%
mutate(sentiment = reorder(sentiment, estimate)) %>%
mutate_each(funs(. - 1), estimate, conf.low, conf.high) %>%
ggplot(aes(estimate, sentiment)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
scale_x_continuous(labels = percent_format()) +
labs(x = "% increase in Android relative to iPhone",
y = "Sentiment")Thus, Trump’s Android account uses about 40-80% more words related to disgust, sadness, fear, anger, and other “negative” sentiments than the iPhone account does. (The positive emotions weren’t different to a statistically significant extent).
We’re especially interested in which words drove this different in sentiment. Let’s consider the words with the largest changes within each category:
tweet_important %>%
inner_join(nrc, by = "word") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate(sentiment = reorder(sentiment, -tf_idf),
word = reorder(word, -tf_idf)) %>%
group_by(sentiment) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = source)) +
facet_wrap(~ sentiment, scales = "free", nrow = 4) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "",
y = "tf-idf") +
scale_fill_manual(name = "", labels = c("Android", "iPhone"),
values = c("red", "lightblue"))This confirms that lots of words annotated as negative sentiments are more common in Trump’s Android tweets than the campaign’s iPhone tweets. It’s no wonder Trump’s staff took away his tweeting privileges for the remainder of the campaign.
Text documents can be utilized in computational text analysis under the bag of words approach.1 Documents are represented as vectors, and each variable counts the frequency a word appears in a given document. While we throw away information such as word order, we can represent the information in a mathematical fashion using a matrix. Each row represents a single document, and each column is a different word:
a abandoned abc ability able about above abroad absorbed absorbing abstract
43 0 0 0 0 10 0 0 0 0 1
These vectors can be very large depending on the dictionary, or the number of unique words in the dataset. These bag-of-words vectors have three important properties:
Considering these three properties, we probably don’t need to keep all of the words. Instead, we could reduce the dimensionality of the data by projecting the larger dataset into a smaller feature space with fewer dimensions that summarize most of the variation in the data. Each dimension would represent a set of correlated words.
In a textual context, this process is known as latent semantic analysis. By identifying words that are closely related to one another, when searching for just one of the terms we can find documents that use not only that specific term but other similar ones. Think about how you search for information online. You normally identify one or more keywords, and search for pages that are related to those words. But search engines use techniques such as LSA to retrieve results not only for pages that use your exact word(s), but also pages that use similar or related words.
NYTimes# get NYTimes data
load("data/pca-examples.Rdata")Let’s look at an application of LSA. nyt.frame contains a document-term matrix of a random sample of stories from the New York Times: 57 stories are about art, and 45 are about music. The first column identifies the topic of the article, and each remaining cell contains a frequency count of the number of times each word appeared in that article.2 The resulting data frame contains 102 rows and 4432 columns.
Some examples of words appearing in these articles:
colnames(nyt.frame)[sample(ncol(nyt.frame),30)]## [1] "penchant" "brought" "structure" "willing" "yielding"
## [6] "bare" "school" "halls" "challenge" "step"
## [11] "largest" "lovers" "intense" "borders" "mall"
## [16] "classic" "conducted" "mirrors" "hole" "location"
## [21] "desperate" "published" "head" "paints" "another"
## [26] "starts" "familiar" "window" "thats" "broker"
We can estimate the LSA using the standard PCA procedure:
# Omit the first column of class labels
nyt.pca <- prcomp(nyt.frame[,-1])
# Extract the actual component directions/weights for ease of reference
nyt.latent.sem <- nyt.pca$rotation
# convert to data frame
nyt.latent.sem <- nyt.latent.sem %>%
as_tibble %>%
mutate(word = names(nyt.latent.sem[,1])) %>%
select(word, everything())Let’s extract the biggest components for the first principal component:
nyt.latent.sem %>%
select(word, PC1) %>%
arrange(PC1) %>%
slice(c(1:10, (n() - 10):n())) %>%
mutate(pos = ifelse(PC1 > 0, TRUE, FALSE),
word = fct_reorder(word, PC1)) %>%
ggplot(aes(word, PC1, fill = pos)) +
geom_col() +
labs(title = "LSA analysis of NYTimes articles",
x = NULL,
y = "PC1 scores") +
coord_flip() +
theme(legend.position = "none")These are the 10 words with the largest positive and negative loadings on the first principal component. The words on the positive loading seem associated with music, whereas the words on the negative loading are more strongly associated with art.
nyt.latent.sem %>%
select(word, PC2) %>%
arrange(PC2) %>%
slice(c(1:10, (n() - 10):n())) %>%
mutate(pos = ifelse(PC2 > 0, TRUE, FALSE),
word = fct_reorder(word, PC2)) %>%
ggplot(aes(word, PC2, fill = pos)) +
geom_col() +
labs(title = "LSA analysis of NYTimes articles",
x = NULL,
y = "PC2 scores") +
coord_flip() +
theme(legend.position = "none")Here the positive words are about art, but more focused on acquiring and trading (“donations”, “tax”). We could perform similar analysis on each of the 103 principal components, but if the point of LSA/PCA is to reduce the dimensionality of the data, let’s just focus on the first two for now.
biplot(nyt.pca, scale = 0, cex = .6)cbind(type = nyt.frame$class.labels, as_tibble(nyt.pca$x[,1:2])) %>%
mutate(type = factor(type, levels = c("art", "music"),
labels = c("A", "M"))) %>%
ggplot(aes(PC1, PC2, label = type, color = type)) +
geom_text() +
labs(title = "") theme(legend.position = "none")## List of 1
## $ legend.position: chr "none"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The biplot looks a bit ridiculous because there are 4432 variables to map onto the principal components. Only a few are interpretable. If we instead just consider the articles themselves, even after throwing away the vast majority of information in the original data set the first two principal components still strongly distinguish the two types of articles. If we wanted to use PCA to reduce the dimensionality of the data and predict an article’s topic using a method such as SVM, we could probably generate a pretty good model using just the first two dimensions of the PCA rather than all the individual variables (words).
devtools::session_info()## setting value
## version R version 3.3.3 (2017-03-06)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-05-22
##
## package * version date source
## assertthat 0.2.0 2017-04-11 cran (@0.2.0)
## backports 1.0.5 2017-01-18 CRAN (R 3.3.2)
## base * 3.3.3 2017-03-07 local
## bit 1.1-12 2014-04-09 CRAN (R 3.3.0)
## bit64 0.9-7 2017-05-08 CRAN (R 3.3.2)
## broom * 0.4.2 2017-02-13 CRAN (R 3.3.2)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.3.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2)
## curl 2.6 2017-04-27 CRAN (R 3.3.2)
## datasets * 3.3.3 2017-03-07 local
## DBI 0.6-1 2017-04-01 CRAN (R 3.3.2)
## devtools 1.13.0 2017-05-08 CRAN (R 3.3.2)
## digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0)
## evaluate 0.10 2016-10-11 CRAN (R 3.3.0)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.3.2)
## foreign 0.8-68 2017-04-24 CRAN (R 3.3.2)
## ggplot2 * 2.2.1.9000 2017-05-12 Github (tidyverse/ggplot2@f4398b6)
## graphics * 3.3.3 2017-03-07 local
## grDevices * 3.3.3 2017-03-07 local
## grid 3.3.3 2017-03-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## haven 1.0.0 2016-09-23 cran (@1.0.0)
## hms 0.3 2016-11-22 CRAN (R 3.3.2)
## htmltools 0.3.6 2017-04-28 cran (@0.3.6)
## httr 1.2.1 2016-07-03 CRAN (R 3.3.0)
## janeaustenr 0.1.4 2016-10-26 CRAN (R 3.3.0)
## jsonlite 1.4 2017-04-08 cran (@1.4)
## knitr * 1.15.1 2016-11-22 cran (@1.15.1)
## lattice 0.20-35 2017-03-25 CRAN (R 3.3.2)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## Matrix 1.2-10 2017-04-28 CRAN (R 3.3.2)
## memoise 1.1.0 2017-04-21 CRAN (R 3.3.2)
## methods * 3.3.3 2017-03-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.3.0)
## modelr * 0.1.0 2016-08-31 CRAN (R 3.3.0)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.3.3)
## openssl 0.9.6 2016-12-31 CRAN (R 3.3.2)
## parallel 3.3.3 2017-03-07 local
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
## psych 1.7.5 2017-05-03 CRAN (R 3.3.3)
## purrr * 0.2.2.2 2017-05-11 CRAN (R 3.3.3)
## R6 2.2.1 2017-05-10 CRAN (R 3.3.2)
## RColorBrewer * 1.1-2 2014-12-07 CRAN (R 3.3.0)
## Rcpp 0.12.10 2017-03-19 cran (@0.12.10)
## readr * 1.1.0 2017-03-22 cran (@1.1.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.3.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0)
## rjson 0.2.15 2014-11-03 cran (@0.2.15)
## rlang 0.1.9000 2017-05-12 Github (hadley/rlang@c17568e)
## rmarkdown 1.5 2017-04-26 CRAN (R 3.3.2)
## rprojroot 1.2 2017-01-16 CRAN (R 3.3.2)
## rvest 0.3.2 2016-06-17 CRAN (R 3.3.0)
## scales * 0.4.1 2016-11-09 CRAN (R 3.3.1)
## slam 0.1-40 2016-12-01 CRAN (R 3.3.2)
## SnowballC 0.5.1 2014-08-09 cran (@0.5.1)
## stats * 3.3.3 2017-03-07 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.3.2)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.3.2)
## tibble * 1.3.0.9002 2017-05-12 Github (tidyverse/tibble@9103a30)
## tidyr * 0.6.2 2017-05-04 CRAN (R 3.3.2)
## tidytext * 0.1.2 2016-10-28 CRAN (R 3.3.0)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.3.2)
## tokenizers 0.1.4 2016-08-29 CRAN (R 3.3.0)
## tools 3.3.3 2017-03-07 local
## twitteR * 1.1.9 2015-07-29 CRAN (R 3.3.0)
## utils * 3.3.3 2017-03-07 local
## withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
## wordcloud * 2.5 2014-06-13 CRAN (R 3.3.0)
## xml2 1.1.1 2017-01-24 CRAN (R 3.3.2)
## yaml 2.1.14 2016-11-12 cran (@2.1.14)
This section drawn from 18.3 in “Principal Component Analysis”..↩
Actually it contains the term frequency-inverse document frequency which downweights words that appear frequently across many documents. This is one method for guarding against any biases caused by stop words.↩